Fit an appropriate ARIMA model for ONE of the following data sets.
a.Tesla stock :Adjusted closing prices(Use yahoo
finance-quantmod package to get the full length of
data).
total unemployment rate: Please download data
from here (from 1949-1-1 to 2024-1-1): https://fred.stlouisfed.org/graph/?g=8ej8Unemployment Rate (UNRATE)library(quantmod)
library(ggplot2)
library(dplyr)
getSymbols("TSLA", src = "yahoo", from = "2010-01-01", to = "2024-01-01")## [1] "TSLA"
## Date Price
## 1 2010-06-29 1.592667
## 2 2010-06-30 1.588667
## 3 2010-07-01 1.464000
## 4 2010-07-02 1.280000
## 5 2010-07-06 1.074000
## 6 2010-07-07 1.053333
ggplot(tsla_df, aes(x = Date, y = Price)) +
geom_line(color = "steelblue") +
labs(title = "Tesla Adjusted Closing Prices", x = "Year", y = "Stock Price") +
theme_minimal()
There is no huge fluctuation of Tesla’s stock price before 2020,
since 2020, Tesla’s stock market is increasing constantly to 2022. The
potential reason for stock price’s drop may because of COVID-19 and the
competition in EV cars from China.
From the graph above, we can see that there is a great difference before 2020 and after 2020. Which means that there is no constant variation, and thus, log transformation might need to apply.
tsla_df$Log_Price <- log(tsla_df$Price)
ggplot(tsla_df, aes(x = Date, y = Log_Price)) +
geom_line(color = "darkred") +
labs(title = "Log Transformed Tesla Stock Prices", x = "Year", y = "Log Price") +
theme_minimal()Get the lag plots and interpret.
library(ggplot2)
library(dplyr)
tsla_df_lags <- tsla_df %>%
mutate(
lag_1 = lag(Price, 1),
lag_2 = lag(Price, 2),
lag_3 = lag(Price, 3),
lag_4 = lag(Price, 4),
lag_5 = lag(Price, 5),
lag_6 = lag(Price, 6),
lag_7 = lag(Price, 7),
lag_8 = lag(Price, 8),
lag_9 = lag(Price, 9)
)
tsla_df_long <- tsla_df_lags %>%
pivot_longer(cols = starts_with("lag_"), names_to = "Lag", values_to = "Lagged_Price")
ggplot(tsla_df_long, aes(x = Lagged_Price, y = Price)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") +
facet_wrap(~Lag, scales = "free") +
labs(
title = "Lag Plots for Tesla Stock Prices",
x = "Lagged Price",
y = "Current Price"
) +
theme_minimal()The lag plot shows a strong linear relationship between Tesla’s stock prices and their past values, indicating high autocorrelation. This suggests that the stock follows a persistent trend rather than behaving randomly.
Follow these steps.
par(mfrow = c(2,1))
acf(diff(tsla_df$Price), main = "ACF of Differenced Tesla Prices")
pacf(diff(tsla_df$Price), main = "PACF of Differenced Tesla Prices")## Warning in adf.test(diff(tsla_df$Price)): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff(tsla_df$Price)
## Dickey-Fuller = -14.991, Lag order = 15, p-value = 0.01
## alternative hypothesis: stationary
From the adf test above, we could see that the series is already stationary, so no need for a secondary differencing.
Since we did take differencing, so the d = 1, and secondary diffferencing is not needed. In ACF plot, it does not showing a clear cut off, so there might not be a strong MA model, so the q could be 0 or 1. PACF does not show a sharp cutoff but instead gradually declines, and this could be a sign of AR model, so the p = 1.
Potential models:
ARIMA(1,1,0)
ARIMA(2,1,0)
ARIMA(3,1,0)
The potential choices has been listed above in question c.
manual search,
auto.arima() and fpp3 package code.library(forecast)
library(quantmod)
library(forecast)
getSymbols("TSLA", src = "yahoo", from = "2010-01-01")## [1] "TSLA"
ts_tesla <- ts(Cl(TSLA), frequency = 252)
model_110 <- arima(ts_tesla, order = c(1,1,0)) # ARIMA(1,1,0)
model_210 <- arima(ts_tesla, order = c(2,1,0)) # ARIMA(2,1,0)
model_310 <- arima(ts_tesla, order = c(3,1,0)) # ARIMA(3,1,0)
AIC(model_110, model_210, model_310)## df AIC
## model_110 2 22770.44
## model_210 3 22771.92
## model_310 4 22773.80
forecast for the best model
you choose.From the AIC resultsm we can see that the model_110 is the best so far.
library(forecast)
library(ggplot2)
best_model <- arima(ts_tesla, order = c(1, 1, 0)) # best model
forecast_best <- forecast(best_model, h = 30)
autoplot(forecast_best) +
labs(title = "Tesla Stock Price Forecast - ARMA(1,1,0)",
x = "Time",
y = "Stock Price") +
theme_minimal()library(forecast)
library(ggplot2)
best_model <- arima(ts_tesla, order = c(1, 1, 0))
forecast_best <- forecast(best_model, h = 30)
autoplot(forecast_best) +
labs(title = "Tesla Stock Price Forecast - ARMA(1,1,0)",
x = "Time",
y = "Stock Price") +
theme_minimal() # same as the previous one##
## Attaching package: 'Metrics'
## The following object is masked from 'package:forecast':
##
## accuracy
library(forecast)
train_size <- round(length(ts_tesla) * 0.8)
train_data <- ts_tesla[1:train_size]
test_data <- ts_tesla[(train_size + 1):length(ts_tesla)]
best_model <- arima(train_data, order = c(1, 1, 0))
forecast_best <- forecast(best_model, h = length(test_data))
arma_mae <- mae(test_data, forecast_best$mean)
arma_mse <- mse(test_data, forecast_best$mean)
naive_forecast <- naive(train_data, h = length(test_data))
naive_mae <- mae(test_data, naive_forecast$mean)
naive_mse <- mse(test_data, naive_forecast$mean)
ma_forecast <- ma(train_data, order = 5)
ma_mae <- mae(test_data, rep(ma_forecast[length(ma_forecast)], length(test_data)))
ma_mse <- mse(test_data, rep(ma_forecast[length(ma_forecast)], length(test_data)))
results <- data.frame(
Model = c("ARMA(1,1,0)", "Naïve (Random Walk)", "Moving Average"),
MAE = c(arma_mae, naive_mae, ma_mae),
MSE = c(arma_mse, naive_mse, ma_mse)
)
print(results)## Model MAE MSE
## 1 ARMA(1,1,0) 66.73350 6077.508
## 2 Naïve (Random Walk) 66.61782 6061.712
## 3 Moving Average NA NA
Again, ARMA 1,1,0 is still the best. It does outperform the other models.
prophet model. Compare?## Loading required package: Rcpp
## Loading required package: rlang
##
## Attaching package: 'rlang'
## The following object is masked from 'package:Metrics':
##
## ll
## The following objects are masked from 'package:purrr':
##
## %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
## flatten_raw, invoke, splice
## Warning: package 'StanHeaders' was built under R version 4.2.3
## [1] "TSLA"
tsla_df <- data.frame(Date = index(TSLA), Price = as.numeric(TSLA$TSLA.Adjusted))
df_prophet <- tsla_df %>%
rename(ds = Date, y = Price) %>%
arrange(ds)
m <- prophet(df_prophet)## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
future <- make_future_dataframe(m, periods = 365)
forecast <- predict(m, future)
plot(m, forecast) +
ggtitle("Tesla Stock Price Forecast using Prophet") +
xlab("Year") +
ylab("Stock Price (USD)")Please use the Ice cream Production
icecreamproduction.csv data set. This is Ice Cream
Production for United States. downloaded from https://fred.stlouisfed.org/series/M0174BUSM411NNBR
These Data Are For Factory Production Only. Source: United States Department Of Agriculture, “Production And Consumption Of Manufactured Dairy Products,” (Technical Bulletin No.722, April 1940) P.71 And Direct From Bureau Of Agricultural Economics.
## 'data.frame': 276 obs. of 2 variables:
## $ observation_date: chr "1918-01-01" "1918-02-01" "1918-03-01" "1918-04-01" ...
## $ M0174BUSM411NNBR: num 2.93 3.22 6.2 8.3 15.22 ...
df <- df %>%
rename(Date = observation_date, Production = M0174BUSM411NNBR) %>%
mutate(Date = as.Date(Date, format = "%Y-%m-%d")) # Convert to Date format
head(df)## Date Production
## 1 1918-01-01 2.93
## 2 1918-02-01 3.22
## 3 1918-03-01 6.20
## 4 1918-04-01 8.30
## 5 1918-05-01 15.22
## 6 1918-06-01 19.64
Plot the data and comment.Check for seasonality: Get the lag plots and Try
decomposing (just plotting is enough) to check for seasonality. Look at
ACF plot to check for seasonal correlation. What do you observe? Also,
comment about stationarity.
ts_icecream <- ts(df$Production, start = c(1918, 1), frequency = 12)
plot(ts_icecream, main = "Ice Cream Production Over Time", ylab = "Production", xlab = "Year")df <- df %>%
mutate(Log_Production = log(Production))
ggplot(df, aes(x = Date, y = Log_Production)) +
geom_line(color = "darkred", size = 1) +
theme_minimal() +
labs(title = "Log-Transformed Ice Cream Production Over Time",
x = "Year",
y = "Log(Production)")## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
library(ggplot2)
library(dplyr)
library(gridExtra)
lags <- 1:9
plots <- lapply(lags, function(l) {
df %>%
mutate(lagged_value = lag(Production, n = l)) %>%
ggplot(aes(x = lagged_value, y = Production)) +
geom_point(alpha = 0.5, color = "black") +
geom_smooth(method = "lm", color = "red", linetype = "dashed") +
labs(title = paste("Lag", l), x = "Lagged Production", y = "Current Production") +
theme_minimal()
})
grid.arrange(grobs = plots, ncol = 3)
From the plot, we can observe a clear seasonality, for sure with
common sense, the production of ice creame is higher in summer, when
temperature is high. However, the series is non-stationary, there are
two main evidence about that. 1: The ACF plot is decreasing slowly to 0.
2. The lag plot does not have a constant correlation and variance. So we
might need to do a differencing.
library(forecast)
library(tseries)
ts_diff1 <- diff(ts_icecream, differences = 1)
adf.test(ts_diff1) ## Warning in adf.test(ts_diff1): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: ts_diff1
## Dickey-Fuller = -15.197, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
##
## Augmented Dickey-Fuller Test
##
## data: ts_diff_seasonal
## Dickey-Fuller = -3.6124, Lag order = 6, p-value = 0.03241
## alternative hypothesis: stationary
## Warning in adf.test(ts_diff_both): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: ts_diff_both
## Dickey-Fuller = -9.3624, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
Since we did differencing, so d = 1. The PACF cuts off after lag
1, suggesting an AR(1) process, p = 1. The ACF shows no significant
spikes beyond lag 1, which suggests MA(1) or no MA component, so the q =
0 or q = 1.
Possible models: - ARIMA(1,1,1) - ARIMA(1,1,0)
library(forecast)
model_110 <- Arima(ts_diff_both, order = c(1,1,0))
model_111 <- Arima(ts_diff_both, order = c(1,1,1))
model_210 <- Arima(ts_diff_both, order = c(2,1,0))
model_211 <- Arima(ts_diff_both, order = c(2,1,1))
calculate_AICc <- function(model) {
k <- length(model$coef) + 1
n <- length(model$residuals)
AICc <- AIC(model) + (2 * k * (k + 1)) / (n - k - 1)
return(AICc)
}
model_comparison <- data.frame(
Model = c("ARIMA(1,1,0)", "ARIMA(1,1,1)", "ARIMA(2,1,0)", "ARIMA(2,1,1)"),
AIC = c(AIC(model_110), AIC(model_111), AIC(model_210), AIC(model_211)),
BIC = c(BIC(model_110), BIC(model_111), BIC(model_210), BIC(model_211)),
AICc = c(calculate_AICc(model_110), calculate_AICc(model_111),
calculate_AICc(model_210), calculate_AICc(model_211))
)
model_comparison <- model_comparison[order(model_comparison$AIC),]
print(model_comparison)## Model AIC BIC AICc
## 4 ARIMA(2,1,1) 1159.696 1173.970 1159.851
## 2 ARIMA(1,1,1) 1185.817 1196.522 1185.909
## 3 ARIMA(2,1,0) 1276.412 1287.117 1276.505
## 1 ARIMA(1,1,0) 1383.975 1391.111 1384.021
auto_model <- auto.arima(ts_diff_both, seasonal = FALSE, stepwise = FALSE, approximation = FALSE)
summary(auto_model)## Series: ts_diff_both
## ARIMA(1,0,4) with zero mean
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4
## 0.4036 -1.1108 0.3668 0.1217 -0.2694
## s.e. 0.1544 0.1503 0.1437 0.1016 0.0753
##
## sigma^2 = 4.398: log likelihood = -565.99
## AIC=1143.99 AICc=1144.32 BIC=1165.42
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.005606498 2.077162 1.554979 -Inf Inf 0.4914622 0.00179376
library(fpp3)
df_tsibble <- ts_diff_both %>%
as_tsibble()
model_fpp3 <- df_tsibble %>%
model(ARIMA(value ~ pdq(3,1,1)))
report(model_fpp3)## Series: value
## Model: NULL model
## NULL model
Choose different models by manual search,
auto.arima() and fpp3 package code.
Surprisingly, auto.arima’s output is way better, p = 1, d = 0, q = 4.
ARIMA(1, 0, 4) is the best, since it got the lowest AIC and BIC values.
library(forecast)
library(Metrics)
arima_model <- Arima(ts_diff_both, order = c(1,0,4))
naive_model <- naive(ts_diff_both, h = 12)
ma_model <- ma(ts_diff_both, order = 3)
arima_forecast <- forecast(arima_model, h = 12)
naive_forecast <- naive_model
mae_arima <- mae(ts_diff_both, fitted(arima_model))
mse_arima <- mse(ts_diff_both, fitted(arima_model))
mae_naive <- mae(ts_diff_both, fitted(naive_model))
mse_naive <- mse(ts_diff_both, fitted(naive_model))
if (!is.null(ma_model) && length(ma_model) == length(ts_diff_both)) {
mae_ma <- mae(ts_diff_both, ma_model)
mse_ma <- mse(ts_diff_both, ma_model)
} else {
mae_ma <- NA
mse_ma <- NA
}
benchmark_results <- data.frame(
Model = c("ARIMA(1,0,4)", "Naïve (Random Walk)", "Moving Average"),
MAE = c(mae_arima, mae_naive, mae_ma),
MSE = c(mse_arima, mse_naive, mse_ma)
)
print(benchmark_results)## Model MAE MSE
## 1 ARIMA(1,0,4) 1.554981 4.314601
## 2 Naïve (Random Walk) NA NA
## 3 Moving Average NA NA
ARIMA(1, 0, 4) is the best.
library(forecast)
best_model <- Arima(ts_diff_both, order = c(1,0,4))
forecast_horizon <- 36
forecast_values <- forecast(best_model, h = forecast_horizon)
autoplot(forecast_values) +
ggtitle("Forecast of Ice Cream Production for Next 3 Years") +
xlab("Year") +
ylab("Production") +
theme_minimal()## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 1941 3.497030e-01 -2.343182 3.042587 -3.768708 4.468114
## Feb 1941 1.002894e-01 -3.198024 3.398603 -4.944045 5.144624
## Mar 1941 -2.076544e-01 -3.513240 3.097931 -5.263111 4.847802
## Apr 1941 -2.014633e-01 -3.533130 3.130203 -5.296807 4.893881
## May 1941 -8.116005e-02 -3.459144 3.296824 -5.247341 5.085021
## Jun 1941 -3.259883e-02 -3.418070 3.352872 -5.210230 5.145033
## Jul 1941 -1.299677e-02 -3.399686 3.373693 -5.192492 5.166498
## Aug 1941 -5.084272e-03 -3.391972 3.381804 -5.184883 5.174714
## Sep 1941 -1.890339e-03 -3.388811 3.385030 -5.181738 5.177958
## Oct 1941 -6.010873e-04 -3.387527 3.386325 -5.180457 5.179255
## Nov 1941 -8.067212e-05 -3.387007 3.386846 -5.179938 5.179777
## Dec 1941 1.293969e-04 -3.386797 3.387056 -5.179728 5.179987
## Jan 1942 2.141926e-04 -3.386713 3.387141 -5.179643 5.180072
## Feb 1942 2.484210e-04 -3.386678 3.387175 -5.179609 5.180106
## Mar 1942 2.622375e-04 -3.386665 3.387189 -5.179595 5.180120
## Apr 1942 2.678146e-04 -3.386659 3.387195 -5.179590 5.180125
## May 1942 2.700659e-04 -3.386657 3.387197 -5.179587 5.180128
## Jun 1942 2.709746e-04 -3.386656 3.387198 -5.179586 5.180128
## Jul 1942 2.713414e-04 -3.386655 3.387198 -5.179586 5.180129
## Aug 1942 2.714895e-04 -3.386655 3.387198 -5.179586 5.180129
## Sep 1942 2.715492e-04 -3.386655 3.387198 -5.179586 5.180129
## Oct 1942 2.715734e-04 -3.386655 3.387198 -5.179586 5.180129
## Nov 1942 2.715831e-04 -3.386655 3.387198 -5.179586 5.180129
## Dec 1942 2.715870e-04 -3.386655 3.387198 -5.179586 5.180129
## Jan 1943 2.715886e-04 -3.386655 3.387198 -5.179586 5.180129
## Feb 1943 2.715893e-04 -3.386655 3.387198 -5.179586 5.180129
## Mar 1943 2.715895e-04 -3.386655 3.387198 -5.179586 5.180129
## Apr 1943 2.715896e-04 -3.386655 3.387198 -5.179586 5.180129
## May 1943 2.715897e-04 -3.386655 3.387198 -5.179586 5.180129
## Jun 1943 2.715897e-04 -3.386655 3.387198 -5.179586 5.180129
## Jul 1943 2.715897e-04 -3.386655 3.387198 -5.179586 5.180129
## Aug 1943 2.715897e-04 -3.386655 3.387198 -5.179586 5.180129
## Sep 1943 2.715897e-04 -3.386655 3.387198 -5.179586 5.180129
## Oct 1943 2.715897e-04 -3.386655 3.387198 -5.179586 5.180129
## Nov 1943 2.715897e-04 -3.386655 3.387198 -5.179586 5.180129
## Dec 1943 2.715897e-04 -3.386655 3.387198 -5.179586 5.180129